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Abstract 

A time series model of CDS sequences in complete genome is proposed. A map of 
DNA sequence to integer sequence is given. The correlation dimensions and Hurst 
exponents of CDS sequences in complete genome of bacteria are calculated. Using 
the average of correlation dimensions, some interesting results are obtained. 
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1 Introduction 

In the past decade or so there has been a ground swell of interest in unraveling the mys- 
teries of DNA. With improving of the technique of gene clone and sequences determined, 
the DNA sequence data base become huge rapidly. Doing DNA sequence analysis only 
use the experimental method does not fit this rapid. Hence it becomes very important to 
improve new theoretical methods. One approach that has, in just a few years, proven to 
be particularly fruitful in this regard is statistical analysis of DNA sequences' 1-9 ' using 
modern statistical measures, including the works on the correlation properties of coding 
and noncoding DNA sequences. The second approach is linguistic approach. DNA se- 
quence can be seen as analogous at a number of levels to mechanisms of processing other 



kinds of languages, such as natural languages and computer languages E^l. Third, using 



'This is the permanent corresponding address of the first author, e-mail: yuzg@hotmail.com 



1 



nonlinear scales method, such as fractal dimensionEH' 0' HH 5 complexity 0' . How- 
ever, DNA sequences are more complicated than these types of analysis can describes. 
Therefore, it is crucial to develop new tools for analysis with a view toward uncovering 
mechanisms used to code other types of information. 

Since the first complete genome of a free-living bacterium Mycoplasma genitalium was 
sequenced in 19950, an ever-growing number of complete genomes has been deposited 
in public databases. The availability of complete genomes opens the possibility to ask 
some global questions on these sequences. Our group also discussed the avoided and 
under-represented strings in some bacterial complete genomes&^l' cJj lia. In this paper, 
we propose a new model to DNA sequences, i.e. the time series model. First we want 
to compute the correlation dimension and Hurst exponents of each CDS sequence in 
the complete genome, then consider the distribution of these two quantities on complete 
genomes of Bacteria. It is a global problem. Last we want to discuss the classification 
problem of Bacteria using our results. 

For the present purpose, a DNA sequence may be regard as a sequence over the 
alphabet {A, C, G, T} representing the four bases from which DNA is assembled, namely 
adenine, cytosine, guanine, and thymine. For a DNA sequence, we define a map / as 
following: 

A — ► -2 

/= c- ~ 1 a) 

G — > 1 v ; 

T — > 2. 

Then we obtain an data sequence {xk '■ k = 1,2, ■ ■ ■ , N}, where x^ G {—2,-1,1,2}. 
We formal view this sequence as a time series. According to the definition of /, the four 
bases {A, C, G, T} are mapped to four distinct value. One can also use {—2, —1, 1, 2} to 
replace {A, G, C, T} or other orders of A, G, C, T. our main aim is distinguish A and G 
from purine, C and T from pyrimidine. We expect it to reveal more information than one 
dimensional DNA wal 



2 Correlation dimension and Hurst exponent 

The notion of correlation dimension, introduced by Grassberger and ProcacciaEl' 0, 
suits well experimental situations, when only a single time series is available, it is now 
being used widely in many branches of physical science. Given a sequence of data from a 
computer or laboratory experiment 

Xi, x 2 , x 3 , ■ ■ ■ , x N (2) 
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where N is a big enough number. These number are usually sampled at an equal time 
interval At. We embed the time series into R m , choose a time delay r = pAr, then 
obtain 

Yi \ x i: %i+py %i+2p? ' ' i %i+(m— * 1; 2, ■ ■ • , N m (3) 

where 

iV m = N - (rn - (4) 

In this way we get N m vectors of embedding space R m . 
For any yi,yj, we define the distance as 

m— 1 

r ij = d{yi,yj) = 2^ \ x i+ip ~ x j+ip\- (^) 

1=0 

If the distance is less than a present number r, we say that these two vectors are correlated. 
The correlation integral is defined as 

i N m 

c Ur) = W T, H(r- rij ) (6) 

- /V m i,j=l 

where H is the Heaviside function 

= { ifX> ° (7) 
V ; | 0, if x < V ; 

For a proper choice of m and not too big a value of r, it has been shown by Grassberger 
and ProcacciaE3 that the correlation integral C m (r) behaves like 

C m {r) oc r D ^ m \ (8) 

Thus one can define correlation dimension as 

D 2 = lim D 2 (m) = lim lim ln ^ r \ (9) 

m >oo m >oor >0 In T 



For more details of D 2 , the reader can refer to ref . [pTH . 

To deal with practical problem, one usually choose p = 1. From Page 346 of ref. PT 
if we choose an sequence {r, : 1 < % < n} such that r\ < r 2 < r 3 < • • • < r n , then in the 
In r — In C m (r) plane, we can look for a scaling region. Then the slop of the scaling region 
is D 2 (m). When D 2 (m) dose not change with m increasing, we can take this D 2 (mo) as 
the estimate value of D 2 . We calculate the correlation dimension of some DNA sequences 
using the method introduced above. From the lnr — In C m (r) figures of these sequences 
of different value of embedding dimension m, we find that it is suitable to choose m = 7. 
For example, we give the In r — In C m (r) figure of Phage's 5'UTR sequence when m = 7, 8 
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Figure 1: lnr — In C m (r) figure of phage 5'UTR sequence when m=7,8. 



(Figure |l|). We take the region from the third point to the 17th point (from left to right) 
as the scaling region 

To study time series, Hurst HI invented a new statistical method — the rescaled range 



analysis (R/S analysis), later on B. B. Mandelbrot@ and J. Feder @ transplanted R/S 
analysis into fractal analysis. For any time series x = {xk}^ =1 and any 2 < n < N, one 
can define 



n 



(10) 



i=i 



X(i,n) = J2[x u - <x> r 



'111 



u=l 



n) = max X(i,n) — min X(i,n) 

Ki<n Ki<n 



1 



S(n) = [-^2(xi- < X > n ) 



211/2 



n 



(12) 
(13) 



Hurst found that 



R(n)/S(n) ~ (-) 

H is called Hurst exponent. 



(14) 
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Figure 2: An example of R/S analysis of DNA sequence 

As n changes from 2 to N, we obtain N — 1 points in ln(n) v.s. \n(R(n) / S(n)) plane. 
Then we can calculate Hurst exponent H of DNA sequence s using the least-square linear 
fit. As an example, we plot the graph of R/S analysis of an exon segment s of mouse' 
DNA sequence (bp 1730- bp 2650 of the record with Accession AF033620 in Genbank) 
in Figure |2|. 

The Hurst exponent is usually used as a measure of complexity. From Page 149 of 
Ref. [22], the trajectory of the record is a curve with a fractal dimension D = 2 — H. 
Hence a smaller H means a more complex system. When applied to fractional Brownian 
motion, if H > 1/2, the system is said to be persistent, which means that if for a given 
time period t, the motion is along one direction, then in the succeeding t time, it's more 
likely that the motion will follow the same direction. While for system with H < 1/2, the 
opposite holds, that is, antipersistent. But when H = 1/2, the system is Brown motion, 
and is random. 

3 Data and results. 

More than 18 bacterial complete genomes are now available in public databases . There 
are four Archaebacteria: Archaeoglobus fulgidus (aful), Pyrococcus horikoshii (pyro), 
Methanococcus jannaschii (mjan), and Methanobacterium thermoautotrophicum (mthe); 



5 



Table 1: Average of D 2 of genes of 18 bacteria. 



Average of Dn 


fineries of Bacterium 


Cate(?orv 


2 805 


TVTa/potiI^ firn^ cfpnit^linm 1 tticpti i 

±V1 V kJldDlllCl iL Clll U Clll U.111 I 1 


CXyp\ m-nofiitiAfP T^nbapfpri^i 

VJl Cllll UUOlllV^ 1 j 1 1 1 KH 1 ' 1 If) 


2.827 


Methanococcus jannaschii (mjan) 


Archaebacteria 


2.872 


Rockettsia prowazekii (rpxx) 


Proteobacteria 


2.883 


Helicobacter pylori 26695 (hpyl) 


Proteobacteria 


2.908 


Helicobacter pylori J99 (hpyl99) 


Proteobacteria 


2.938 


Haemophilus influenzae (hinf) 


Proteobacteria 


2.940 


Mycoplasma pneumoniae (mpneu) 

•J l l V l / 


Gram-positive Eubacteria 


2.950 


Mycobacterium tuberculosis (mtub) 

•J \ J 


Gram-positive Eubacteria 


2.990 


Bacillus subtilis (bsub) 


Gram-positive Eubacteria 


q ni i 


/\nin toy q c^c\ 1 i i -i lc 1 cinncifi 1 
.ri.L[U.llt;A. dcUllL.U.0 I dL[U.clt: J 


lly JJt:l tilt:! 1I1UJJ11111U Udt tt:l 1 Ulil 


3.012 


Pyrococcus horikoshii (pyro) 


Archaebacteria 


3.013 


M. thermoautotrophicum 


Archaebacteria (mthe) 


3.016 


Archaeoglobus fulgidus (aful) 


Archaebacteria 


3.020 


Chlamydia trachomatis (ctra) 


Chlamydia 


3.024 


Chlamydia pneumoniae (cpneu) 


Chlamydia 


3.028 


Synechocystis PCC6803 (synecho) 


Cyanobacteria 


3.047 


Rhizobium sp. NGR234 (pNGR234) 


Proteobacteria 


3.060 


Escherichia coli (ecoli) 


Proteobacteria 



four Gram-positive Eubacteria: Mycobacterium tuberculosis (mtub), Mycoplasma pneu- 
moniae (mpneu), Mycoplasma genitalium (mgen), and Bacillus subtilis (bsub). The oth- 
ers are Gram-negative Eubacteria: one hyperthermophilic bacterium Aquifex aeolicus 
(aquae); six proteobacteria: Rhizobium sp. NGR234 (pNGR234), Escherichia coli (ecoli), 
Haemophilus influenzae (hinf), Helicobacter pylori J99 (hpyl99), Helicobacter pylori 26695 
(hpyl) and Rockettsia prowazekii (rpxx); two chlamydia Chlamydia trachomatis (ctra) 
and Chlamydia pneumoniae (cpneu), and one cyanobacterium Synechocystis PCC6803 
(synecho). 

For a given bacterium, we calculate the correlation dimension and Hurst exponent of 
each CDS sequence (i.e. the coding sequence) in its complete genome first (the results is 
shown in Figure 3 and Figure 4), then calculate the average of these two quantities. We 
find that the average of Hurst exponents of 18 bacteria are almost equal (range being from 
0.538 to 0.590). But the differences among the values of average of correlation dimensions 
of these bacteria are larger. One can see Table 1 (from top to bottom, the value of D 2 
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become larger). 



4 Discussion and conclusions 

Although the existence of the archaebacterial urkingdom has been accepted by many 
biologists, the classification of bacteria is still a matter of controversy® . The evolu- 
tionary relationship of the three primary kingdoms (i.e. archeabacteria, eubacteria and 
eukaryote) is another crucial problem that remains unresolved^'. 

From Table 1, we can roughly divide bacteria into two class first, the average of D 2 
of one class is less than 3.0, that of another class is greater than 3.0. We can see that 
the classification of bacteria using the average of D 2 is almost coincide with the original 
classification of bacteria. Archaebacteria gather with each other except mjan. Gram- 
positive bacteria get together except mgen. Chlamydia also gather with each other. 
Proteobacteria is divided into two sub-category: rpxx, hpyl, hpyl99 and hinf belong to 
one sub-category; pNGR234 and ecoli belong another sub-category. 

A surprising feature shown in Table 1 is that Aquifex aeolicus is linked closely with the 
Archaebacteria. We noticed that Aquifex, like most Archaebacteria, is hyperthermophilic. 
It has previously been shown that Aquifex has close relationship with Archaebacteria from 



the gene comparison of an enzyme needed for the synthesis of the amino acid trytophan^' . 
Our result, from the comparison of the complete genome, shows that the case is even 
more worse. Such strong correlation on the level of complete genome between Aquifex 
and Archaebacteria is not easily accounted for by lateral transfer and other accidental 
events 



We calculate the average of correlation dimensions and Hurst exponents of genes in all 
16 chromosome of Saccharomyces cerevisiae (yeast), they are 3.018 and 0.579 respectively. 
From Table 1, one can see that Archaebacteria and Chlamydia are linked more closely 
with yeast which belongs to eukaryote than other category of bacteria. There are several 
reports (such as Ref. f27|), in some RNA and protein species, archeabacteria are much 
more similar in sequences to eukaryotes than to eubacteria. Our present result supports 
this point of view. 

We also randomly produce a sequence of length 3000 consisting of symbols from the 
alphabet {A,T,G,C}. The correlation dimension is 1.02883. From Table [l], Fig. 3, we 
can conclude that all CDS sequences are far from random sequences. Since the Hurst 
exponent of random sequence is 0.5. From Fig. 4, we can see that correlation dimension 
is well than Hurst exponent when we compare real DNA sequence with a random sequence 
on the alphabet {A, T, G, C}. 

In Ref. [Ill], we find the Hurst exponent is a good exponent to distinct different 
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functional regions, but now we only consider the same kind of functional region (i.e. they 
are all genes), it is reasonable that the average of Hurst exponent do not change much. 
Fortunately, now we can use the average of correlation dimension to distinguish different 
species. 
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;ure 4: The Hurst exponents of CDS sequences in the complete genome of 18 bacteria. 
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